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Abstract 


This  project  focused  on  exploratory  and  structural  inference  for  high  dimensional 
data.  Exploratory  techniques  focused  mainly  on  the  parallel  coordinate  technique  which 
was  exploited  on  graphical  computing  platforms.  A  MS-DOS  package  was  developed 
using  this  and  other  dynamical  graphics  techniques.  Several  refinements  of  the  parallel 
coordinate  technique  were  made  including  scintillation,  parallel  coordinate  density  plots 
and  grand  tours  in  d-dimensions.  The  structural  inference  work  is  nonparametric  in 
character  and  focuses  on  high  dimensional  density  estimation  and  ridge  estimation. 
These  ideas  are  currently  still  under  development  and  promising  techniques  involved 
random  tessellations,  the  trajectory  method  for  finding  ridges  and  the  gradient  density 
plot. 
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1.  Introduction. 

Electronic  instrumentation  implies  an  ability  to  acquire  a  large  amount  of  high 
dimensional  data  very  rapidly.  While  such  capabilities  have  existed  for  some  time,  the 
emergence  of  cheap  RAM  in  the  1980s  has  given  us  the  ability  to  store  and  access  that 
data  in  an  active  computer  memory.  It  is  our  thesis  that  this  new  ability  represents  a 
challenge  for  statisticians  to  develop  methodology  that  is  substantially  different  in  kind 
from  previous  methodology.  Historically,  the  majority  of  existing  methodology  is 
focused  on  the  univariate,  i.i.d.  random  variable  model.  Even  in  the  circumstance  that 
a  multivariate  model  is  allowed,  it  is  usually  assumed  to  be  multivariate  normal.  A 
premise  of  our  research  has  also  been  that  while  arbitrary  sample  size  is  frequently 
assumed,  the  truth  of  the  matter  is  that  common  techniques  implicitly  assume  small  to 
moderate  sample  sizes.  For  example  a  regression  problem  with  5  design  variables  and 
1000  observations  would  represent  no  problem  for  traditional  techniques.  By  contrast  a 
regression  problem  with  40,000  design  variables  and  8  million  observations  would.  The 
reason  is  clear.  In  the  former  case  the  emphasis  is  on  statistical  efficiency  which 
implicitly  most  current  statistical  methodology  optimizes.  By  contrast  the  latter  case 
emphasis  must  clearly  be  on  computational  efficiency.  The  emphasis  placed  on 
parsimony  in  many  contemporary  books  and  papers  is  a  further  reflection  of  the 
mentality  focused  on  small  to  moderate  sample  sizes.  Finally,  we  note  that  the  very 
fact  of  largeness  in  sample  size  implies  that  it  is  unlikely  we  would  see  i.i.d. 
homogeneity. 

The  premise  of  this  research  project  was  to  focus  on  the  new  perspective  required 
for  the  analysis  of  large,  high  dimensional  data  sets.  We  intended  to  initiate  an 
exploration  of  this  perspective.  Primary  focus  was  on  data  representation  and  data 
exploration  and  secondly  on  structural  inference.  The  tasks  related  to  these  two  foci 
were  are  outlined  below. 

1.  Develop  techniques  for  visualizing  traditional  statistical  notions  such  as 

clustering,  correlation,  symmetry  in  parallel  coordinate  displays. 

2.  Relate  statistical  measures  to  simple  statistical  features  of  the  parallel 

coordinate  display. 

3.  Explore  hyperdimensional  geometry  in  parallel  coordinates. 

4.  Implement  parallel  coordinate  diagrams  on  mini/micro  computers. 

5.  Construct  the  grand  tour  in  parallel  coordinates. 
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6.  Explore  the  combinatorial  issues  of  parallel  coordinates. 

7.  Develop  a  high  speed  algorithm  for  multidimensional  density  estimation 

based  on  the  Dirichlet  tessellation 

8.  Develop  the  theoretical  properties  of  this  multidimensional  density  estimate. 

9.  Develop  a  formal  definition  of  and  estimation  procedures  for  d-ridges. 

10.  Develop  a  parallel  computing  algorithm  for  multidimensional  density 

estimation  and  d-ridge  estimation.  Develop  the  corresponding  graphical  tools. 

Develop  the  associated  non-Gaussian  statistical  inference  theory. 

2.  Summary  of  Results 

a.  Parallel  Coordinate  Displays.  The  parallel  coordinate  multidimensional  data 
representation  has  been  studied  thoroughly.  The  parallel  coordinate  display  is  an 
excellent  tool  for  statistical  data  representation  as  it  is  shown  to  be  characterized  by 
projective  transformations  of  En  into  E2.  This  feature  gives  an  elegant  matrix 
formulation  of  parallel  coordinate  diagrams  and  also  assures  that  the  duality  properties 
commonly  found  in  projective  geometry  hold  for  parallel  coordinate  displays.  As  an 
example,  quadratic  forms  map  into  quadratic  forms,  i.e.  conic  sections  map  into  conic 
sections.  One  particularly  useful  duality  is  that  ellipses  map  into  hyperbolas.  This  has 
a  very  useful  interpretation  for  data  which  has  a  joint  density  with  ellipsoidal  cross 
sections.  Hyperdimensional  ellipsoids  are  easily  recognized  in  parallel  coordinates.  We 
have  found  particularly  useful  ideas  in  working  with  parallel  coordinate  displays  to  be 
implementations  of  painting  (brushing),  scintillation,  density  plots  and  grand  tours. 

As  with  all  data  plots  having  a  large  number  of  observations,  there  is  heavy 
overplotting  in  parallel  coordinate  displays.  We  have  improved  the  situation 
substantially  by  using  different  colors  for  plotting  each  observation.  To  a  large  extent 
this  allows  us  to  track  a  single  observation  through  all  dimensions  of  a  parallel 
coordinate  display.  By  using  various  colored  paints  including  rainbow  paint  and 
invisible  paint,  we  have  been  able  to  manipulate  the  displays  v^ry  effectively.  Solid 
colored  paint  allows  us  to  paint  clusters  and  separate  blocks  ot  data  in  a  natural  way. 
Invisible  paint  allows  us  to  eliminate  outlier  and/or  subclusters  to  examine  the  regular 
structure  more  carefully.  Rainbow  paint  allows  us  to  return  to  multicolored  diagrams 
easily. 

When  a  diagram  is  rainbow  painted  and  there  is  still  a  considerable  amount  of 
overplotting,  we  have  found  scintillation  to  be  a  useful  tool.  The  idea  is  that  if  each 
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observation  is  a  different  color  and  we  simply  swap  colors  sequentially  at  all  points 
where  there  is  overplotting,  we  obtain  a  dynamical  flashing  display  in  which  the  speed 
of  flashing  conveys  a  sense  of  how  many  points  are  overplotted.  The  dynamic  color 
display  also  allows  us  to  track  observations  where  there  would  be  ambiguity  due  to 
overplotting. 

An  alternate  technique  when  overplotting  is  even  heavier  is  to  construct  the 
parallel  coordinate  (line)  density  display.  The  idea  is  to  replace  the  parallel  coordinate 
diagram  with  a  density  version  of  it.  This  has  enormous  benefit  when  global  structure 
is  obscured  by  overplotting  even  when  the  sample  sizes  are  fairly  small.  We  have 
demonstrated  ability  to  detect  a  hole  on  the  inside  of  a  4-dimensional  hypersphere  when 
we  have  only  sampled  data.  This  is  an  excellent  achievement  because  1.  the  spherical 
symmetry  of  a  4-sphere  eliminates  any  preferential  projection  direction  and  2.  any  three 
or  lower  dimensional  projection  of  a  4-sphere  will  not  show  any  holes. 

The  final  idea  in  connection  with  parallel  coordinates  that  we  have  explored  is 
the  grand  tour.  The  grand  tour  tries  to  capture  the  idea  of  looking  at  the  data  from  all 
possible  directions.  It  may  be  thought  of  as  a  general  affine  rotation  of  a  d-dimensional 
coordinate  system  in  d-space  and  then  plotting  data  points  expressed  in  the  rotated 
coordinate  system  in  a  screen  display  parallel  coordinate  system.  This  has  proven 
extraordinarily  effective  and  has  allowed  us  to  demonstrate  finding  linear  substructures 
of  6  dimensions  in  9-dimensional  data  and  also  to  find  7-dimensional  clusters  in  f 
dimensional  data. 

b.  Other  Results.  A  wide  variety  of  other  items  were  addressed.  Some  of  the 
tasks  have  been  addressed  but  are  not  yet  in  technical  report  form.  I  will  mention  some 
of  these  results  briefly.  We  have  now  formal  definition  of  d-ridges  and  have  several 
methods  for  estimation  of  d-ridges.  One  of  the  most  interesting  is  a  direct  trajectory 
method  based  on  multidimensional  density  hypersurfaces.  A  two-dimensional 
illustration  is  given  in  Figure  1.  The  ridge  is  apparent  in  this  display.  The  location  of 
the  ridge  seems  fairly  robust  with  respect  to  the  smoothing  parameters.  In  order  to 
locate  the  ridge  more  precisely,  we  have  constructed  a  gradient  based  technique.  The 
estimated  gradient  is  shown  in  a  3-d  perspective  plot  in  Figure  2.  Figures  1  and  2  are 
based  on  the  same  data.  It  should  be  apparent  that  these  are  highly  nonlinear  data 
structures  and  we  have  quite  successful  estimates  of  them.  The  ideas  generalize  in  a 
straightforward  way  to  general  d-dimensional  space.  This  work  is  being  carried  out  with 
my  Ph.D.  student  Qiang  Luo.  Our  work  on  density  estimates  based  random 
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Figure  1.  Nonlinear  curve  (C-curve)  estimated  using  trajectory  method 
based  a  smoothed  density  plot. 
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Figure  2.  3-dimensional  Perspective  Plot  of  the  Density 
Derivative  of  the  Nonlinear  Curve  (C-curve). 


tessellations  is  also  nearing  completion  and  will  be  reported  on  soon.  That  work  is 
being  carried  out  with  my  Ph.D.  student  Leonard  Hearne.  All  of  the  tasks  listed  above 
and  in  our  1987  proposal  have  been  carried  out.  A  complete  list  of  reports  follows. 
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